Advance Analytics with R (UG 21-24)
I am Ayush.
I am a researcher working at the intersection of data, law, development and economics.
I teach Data Science using R at Gokhale Institute of Politics and Economics
I am a RStudio (Posit) certified tidyverse Instructor.
I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.
Reach me
ayush.ap58@gmail.com
ayush.patel@gipe.ac.in
Dip our toes into classification techniques. How to apply and assess these methods.
References for this lecture:
“….often the methods used for classification first predict the probability that the observation belongs to each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression methods.”
Default data| default | student | balance | income |
|---|---|---|---|
| No | No | 729.5265 | 44361.625 |
| No | Yes | 817.1804 | 12106.135 |
| No | No | 1073.5492 | 31767.139 |
| No | No | 529.2506 | 35704.494 |
| No | No | 785.6559 | 38463.496 |
| No | Yes | 919.5885 | 7491.559 |
| No | No | 825.5133 | 24905.227 |
| No | Yes | 808.6675 | 17600.451 |
| No | No | 1161.0579 | 37468.529 |
| No | No | 0.0000 | 29275.268 |
| No | Yes | 0.0000 | 21871.073 |
| No | Yes | 1220.5838 | 13268.562 |
| No | No | 237.0451 | 28251.695 |
| No | No | 606.7423 | 44994.556 |
| No | No | 1112.9684 | 23810.174 |
| No | No | 286.2326 | 45042.413 |
| No | No | 0.0000 | 50265.312 |
| No | Yes | 527.5402 | 17636.540 |
| No | No | 485.9369 | 61566.106 |
| No | No | 1095.0727 | 26464.631 |
Default is our response(\(Y\)).Yes or No.I ran this: \(p(balance) = \beta_0 + \beta_1X\)
## make a dummy for default
Default|>
mutate(
default_dumm = ifelse(
default == "Yes",
1,0
)
)-> def_dum
## regress dummy over balance and plot
lm(default_dumm ~ balance,
data = def_dum)|>
broom::augment()|>
ggplot(aes(balance,default_dumm))+
geom_point(alpha= 0.6)+
geom_line(aes(balance, .fitted),
colour = "red")+
labs(
title = "Linear regression fit to qualitative response",
subtitle = "Yes =1, No = 0",
y = "prob default status"
)+
theme_minimal() -> plot_linear
## Run the logistic regression
glm(
default_dumm ~ balance,
data = def_dum,
family = binomial
)|>
broom::augment(type.predict = "response")|>
ggplot(aes(balance,default_dumm))+
geom_point(alpha= 0.6)+
geom_line(aes(balance, .fitted),
colour = "red")+
labs(
title = "Logistic regression fit to qualitative response",
subtitle = "Yes =1, No = 0",
y = "prob default status"
)+
theme_minimal() -> logistic_plotWe saw that some fitted values in the linear model were negative.
We need a function that will return values between [0,1].
\[p(X) = \frac{e^{(\beta_0 + \beta_1X)}}{1+e^{\beta_0 + \beta_1X}}\]
This is the logistic function, modeled by the maximum likelihood method.
odds:
\[\frac{p(X)}{1-p(X)}\] **log odds or logit:
\[log(\frac{p(X)}{1-p(X)}) = \beta_0 + \beta_1X\]
if the following are the results of the model \(logit(p(default)) = \beta_0 + \beta_1Balance\):
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -10.651330614 | 0.3611573721 | -29.49221 | 3.623124e-191 |
| balance | 0.005498917 | 0.0002203702 | 24.95309 | 1.976602e-137 |
What is the probability of default with balance $5000??
\[p(X) = \frac{e^{(\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_nX_n)}}{1+e^{\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_nX_n}}\]
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -1.086905e+01 | 4.922555e-01 | -22.080088 | 4.911280e-108 |
| income | 3.033450e-06 | 8.202615e-06 | 0.369815 | 7.115203e-01 |
| balance | 5.736505e-03 | 2.318945e-04 | 24.737563 | 4.219578e-135 |
| studentYes | -6.467758e-01 | 2.362525e-01 | -2.737646 | 6.188063e-03 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -3.5041278 | 0.07071301 | -49.554219 | 0.0000000000 |
| studentYes | 0.4048871 | 0.11501883 | 3.520181 | 0.0004312529 |
There is no consesus in statistics community over a single measure that can describe a goodness of fit for logistic regression.
Use the Credit data in {ISLR}.
What you just did is called Stratified binary model.
to Multinomial Logistic Regression
\[Pr(Y=k|X=x) = \frac{e^{\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}xp}}{1+\sum_{l=1}^{K-1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}\]
for k = 1,…K-1, and
\[Pr(Y=K|X=x) = \frac{1}{1+\sum_{l=1}^{K-1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}\]
\[log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)}) = \beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}xp\]
Which class is treated as reference or baseline is unimportant.
How to interpret this?
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
multi_log <- nnet::multinom(
formula = species ~ body_mass_g + bill_length_mm + bill_depth_mm + flipper_length_mm + sex + island,
data = peng_ref
)# weights: 27 (16 variable)
initial value 365.837892
iter 10 value 21.914358
iter 20 value 1.629266
iter 30 value 0.026372
final value 0.000049
converged
Call:
nnet::multinom(formula = species ~ body_mass_g + bill_length_mm +
bill_depth_mm + flipper_length_mm + sex + island, data = peng_ref)
Coefficients:
(Intercept) body_mass_g bill_length_mm bill_depth_mm
Adelie 502.6573 -0.08755830 -20.075027 34.82987
Chinstrap -434.3867 -0.02106537 6.332771 -16.48865
flipper_length_mm sexmale islandDream islandTorgersen
Adelie 0.5054518 33.23469 62.03886 144.9809
Chinstrap 1.7645190 -55.22699 335.85058 63.1425
Std. Errors:
(Intercept) body_mass_g bill_length_mm bill_depth_mm
Adelie 0.5314853 2.351402 29.93540 5.286822
Chinstrap 0.5310960 4.080649 29.91681 5.278463
flipper_length_mm sexmale islandDream islandTorgersen
Adelie 49.88305 0.2294146 0.531096 4.701009e-47
Chinstrap 49.81079 0.2290253 0.531096 4.261135e-130
Residual Deviance: 9.874339e-05
AIC: 32.0001
# calculate z-statistics of coefficients
z_stats <- summary(multi_log)$coefficients/
summary(multi_log)$standard.errors
# convert to p-values
p_values <- (1 - pnorm(abs(z_stats)))*2
# display p-values in transposed data frame
data.frame(t(p_values)) Adelie Chinstrap
(Intercept) 0.000000e+00 0.000000000
body_mass_g 9.702963e-01 0.995881131
bill_length_mm 5.024680e-01 0.832357200
bill_depth_mm 4.456258e-11 0.001785562
flipper_length_mm 9.919154e-01 0.971741303
sexmale 0.000000e+00 0.000000000
islandDream 0.000000e+00 0.000000000
islandTorgersen 0.000000e+00 0.000000000
Gentoo Adelie Chinstrap
1 1.565008e-135 1.000000e+00 1.009721e-242
2 3.833780e-97 1.000000e+00 1.450741e-166
3 3.913549e-122 1.000000e+00 1.006490e-181
5 3.854489e-165 1.000000e+00 2.652195e-247
6 2.628864e-168 1.000000e+00 9.671388e-281
7 5.558841e-114 1.000000e+00 3.782674e-190
8 3.576898e-116 1.000000e+00 2.880335e-227
13 3.717985e-108 1.000000e+00 3.470609e-172
14 5.313520e-178 1.000000e+00 2.906555e-297
15 4.397228e-190 1.000000e+00 9.358591e-320
16 4.638659e-132 1.000000e+00 3.570151e-212
17 1.323762e-143 1.000000e+00 1.383387e-216
18 3.906700e-111 1.000000e+00 6.765722e-218
19 2.342237e-174 1.000000e+00 3.742224e-262
20 1.803219e-103 1.000000e+00 6.877397e-206
21 3.440325e-75 1.000000e+00 1.084873e-188
22 2.942167e-90 1.000000e+00 4.087688e-228
23 1.885524e-93 1.000000e+00 8.693837e-211
24 1.304105e-64 1.000000e+00 3.625997e-196
25 2.258588e-50 1.000000e+00 2.714209e-176
26 1.050278e-93 1.000000e+00 4.472482e-212
27 5.070290e-66 1.000000e+00 1.977702e-192
28 4.657299e-56 1.000000e+00 1.773975e-147
29 6.353727e-88 1.000000e+00 1.524263e-202
30 1.459617e-55 1.000000e+00 2.364103e-190
31 1.084357e-69 1.000000e+00 9.173878e-17
32 1.227459e-100 1.000000e+00 5.420913e-94
33 1.265506e-86 1.000000e+00 2.282452e-34
34 8.494378e-82 1.000000e+00 4.163335e-66
35 3.892077e-102 1.000000e+00 1.530044e-47
36 5.034347e-123 1.000000e+00 7.437468e-121
37 3.676963e-116 1.000000e+00 5.542033e-110
38 2.071518e-62 1.000000e+00 3.696922e-16
39 2.421798e-124 1.000000e+00 2.037458e-93
40 6.809138e-66 1.000000e+00 1.601284e-61
41 3.425600e-120 1.000000e+00 7.624079e-81
42 1.606314e-77 1.000000e+00 4.276671e-50
43 6.806524e-135 1.000000e+00 5.590874e-97
44 1.276361e-49 1.000000e+00 3.090448e-26
45 1.481080e-105 1.000000e+00 2.763406e-53
46 2.564070e-66 1.000000e+00 2.715698e-55
47 3.442093e-99 1.000000e+00 7.482183e-81
49 2.187000e-113 1.000000e+00 2.595632e-71
50 2.061308e-96 1.000000e+00 2.896132e-90
51 2.979432e-49 1.000000e+00 3.169427e-145
52 1.697699e-47 1.000000e+00 1.852417e-180
53 3.668664e-95 1.000000e+00 1.072902e-201
54 3.787966e-52 1.000000e+00 1.067250e-172
55 8.398157e-123 1.000000e+00 2.068522e-229
56 4.242336e-55 1.000000e+00 1.503984e-174
57 1.477876e-49 1.000000e+00 3.319481e-146
58 9.797629e-62 1.000000e+00 3.358364e-184
59 2.927841e-83 1.000000e+00 9.109545e-178
60 1.502988e-94 1.000000e+00 3.440099e-226
61 3.045152e-84 1.000000e+00 8.887102e-183
62 4.785268e-68 1.000000e+00 5.169778e-209
63 4.444320e-52 1.000000e+00 3.202953e-150
64 1.419865e-38 1.000000e+00 2.020648e-158
65 2.360178e-92 1.000000e+00 2.038745e-188
66 5.421996e-35 1.000000e+00 4.069958e-151
67 5.476171e-70 1.000000e+00 3.160205e-158
68 2.078893e-49 1.000000e+00 3.188062e-179
69 7.954905e-146 1.000000e+00 1.710102e-209
70 1.077474e-99 1.000000e+00 7.562552e-198
71 3.865774e-182 1.000000e+00 1.261787e-274
72 4.918193e-122 1.000000e+00 6.676413e-220
73 6.012949e-105 1.000000e+00 1.034230e-162
74 1.910730e-68 1.000000e+00 4.867732e-150
75 3.284153e-138 1.000000e+00 2.276590e-215
76 2.618102e-84 1.000000e+00 9.775178e-174
77 9.229092e-81 1.000000e+00 2.732396e-137
78 1.219806e-157 1.000000e+00 3.841097e-274
79 5.636064e-116 1.000000e+00 4.127281e-182
80 5.403569e-109 1.000000e+00 2.345329e-202
81 2.595345e-160 1.000000e+00 5.445574e-234
82 6.249382e-53 1.000000e+00 5.461387e-139
83 5.962076e-143 1.000000e+00 2.474997e-229
84 1.620648e-166 1.000000e+00 1.214968e-284
85 1.460100e-104 1.000000e+00 1.627128e-56
86 5.435650e-115 1.000000e+00 2.320514e-97
87 4.252100e-136 1.000000e+00 7.652729e-132
88 5.233585e-114 1.000000e+00 1.076593e-75
89 3.364375e-108 1.000000e+00 1.960486e-98
90 5.170773e-96 1.000000e+00 8.844224e-54
91 2.399353e-116 1.000000e+00 1.564235e-67
92 2.369104e-57 1.000000e+00 5.984160e-23
93 1.586023e-119 1.000000e+00 1.344923e-80
94 1.484469e-60 1.000000e+00 3.282288e-46
95 1.300253e-107 1.000000e+00 1.283398e-61
96 9.985286e-73 1.000000e+00 1.402530e-42
97 3.686971e-96 1.000000e+00 1.308798e-55
98 1.683356e-66 1.000000e+00 1.620603e-44
99 1.008476e-129 1.000000e+00 6.726151e-87
100 7.608280e-50 1.000000e+00 1.154731e-20
101 3.825182e-85 1.000000e+00 1.162753e-192
102 2.022024e-43 1.000000e+00 3.536458e-174
103 1.322537e-55 1.000000e+00 4.846989e-143
104 1.577519e-86 1.000000e+00 1.054969e-231
105 4.338556e-101 1.000000e+00 1.474158e-197
106 1.261728e-78 1.000000e+00 6.837177e-209
107 9.369108e-44 1.000000e+00 3.189705e-131
108 2.378118e-96 1.000000e+00 3.188707e-237
109 5.296813e-63 1.000000e+00 6.022331e-159
110 6.780156e-06 9.999932e-01 1.699143e-128
111 1.872526e-34 1.000000e+00 9.761655e-120
112 5.672252e-10 1.000000e+00 2.800928e-138
113 2.520868e-61 1.000000e+00 6.490000e-149
114 3.439657e-41 1.000000e+00 1.510222e-165
115 1.613357e-80 1.000000e+00 8.393870e-198
116 4.589386e-26 1.000000e+00 2.167375e-139
117 1.333710e-133 1.000000e+00 7.219327e-193
118 1.875383e-181 1.000000e+00 6.421625e-293
119 5.416429e-142 1.000000e+00 1.382857e-212
120 1.686160e-134 1.000000e+00 1.870521e-225
121 7.970359e-148 1.000000e+00 3.536496e-218
122 1.292156e-177 1.000000e+00 3.222353e-281
123 4.198999e-96 1.000000e+00 3.384342e-165
124 2.604573e-112 1.000000e+00 8.559222e-197
125 1.837631e-140 1.000000e+00 4.157304e-204
126 1.947839e-121 1.000000e+00 3.828437e-215
127 2.478227e-127 1.000000e+00 1.775509e-191
128 1.024613e-90 1.000000e+00 9.595811e-183
129 1.396937e-126 1.000000e+00 1.546359e-184
130 3.280699e-78 1.000000e+00 1.061450e-146
131 2.298752e-132 1.000000e+00 1.046066e-200
132 3.065056e-121 1.000000e+00 1.841132e-206
133 3.030640e-114 1.000000e+00 2.000481e-72
134 8.112726e-87 1.000000e+00 2.224610e-71
135 7.842629e-91 1.000000e+00 6.644925e-43
136 3.409778e-60 1.000000e+00 2.490694e-29
137 1.683988e-121 1.000000e+00 2.224177e-74
138 1.033267e-106 1.000000e+00 5.770265e-90
139 2.701357e-84 1.000000e+00 8.079601e-33
140 8.450468e-66 1.000000e+00 1.488202e-42
141 3.152874e-67 9.999593e-01 4.068580e-05
142 1.617609e-75 1.000000e+00 2.721354e-42
143 7.409834e-126 1.000000e+00 3.397055e-75
144 8.994604e-63 1.000000e+00 7.923681e-28
145 5.783065e-103 1.000000e+00 8.678595e-44
146 4.605532e-105 1.000000e+00 4.108604e-90
147 4.342044e-80 1.000000e+00 1.572962e-65
148 1.885663e-113 1.000000e+00 3.916754e-78
149 5.687431e-113 1.000000e+00 2.382362e-66
150 2.105977e-104 1.000000e+00 3.060488e-83
151 4.036940e-91 1.000000e+00 6.654021e-48
152 1.912664e-70 1.000000e+00 3.972737e-38
153 1.000000e+00 1.772889e-109 1.371973e-36
154 1.000000e+00 1.293197e-123 1.826463e-68
155 1.000000e+00 7.520028e-117 3.422658e-36
156 1.000000e+00 6.893219e-143 8.765637e-70
157 1.000000e+00 8.385801e-122 6.314732e-71
158 1.000000e+00 1.507993e-110 7.335537e-39
159 1.000000e+00 1.320615e-93 2.768662e-51
160 1.000000e+00 2.262479e-93 3.099923e-74
161 1.000000e+00 1.120134e-78 2.435404e-46
162 1.000000e+00 1.043816e-91 2.769695e-77
163 1.000000e+00 1.266441e-61 1.520768e-53
164 1.000000e+00 2.726199e-115 3.866374e-79
165 1.000000e+00 9.945072e-102 6.813724e-41
166 1.000000e+00 8.140272e-145 4.315103e-75
167 1.000000e+00 1.696385e-74 1.841659e-45
168 1.000000e+00 3.811851e-135 1.988430e-77
169 1.000000e+00 4.187589e-56 1.407891e-47
170 1.000000e+00 4.529948e-158 3.567558e-75
171 1.000000e+00 1.564320e-102 6.698092e-50
172 1.000000e+00 7.030489e-119 2.242786e-66
173 1.000000e+00 3.026554e-158 8.663190e-63
174 1.000000e+00 3.137029e-99 3.705486e-50
175 1.000000e+00 4.648131e-89 2.375477e-42
176 1.000000e+00 1.702306e-77 1.311517e-80
177 1.000000e+00 3.163518e-101 3.495546e-46
178 1.000000e+00 3.054452e-88 1.327207e-76
180 1.000000e+00 1.724005e-125 3.039510e-76
181 1.000000e+00 3.604353e-115 2.263329e-40
182 1.000000e+00 3.118941e-135 1.353999e-67
183 1.000000e+00 7.598731e-100 9.617386e-71
184 1.000000e+00 1.264204e-73 3.452318e-56
185 1.000000e+00 6.903902e-103 9.568626e-57
186 1.000000e+00 4.939525e-210 2.816545e-50
187 1.000000e+00 3.582723e-134 7.603848e-39
188 1.000000e+00 1.878878e-100 8.760223e-78
189 1.000000e+00 4.506500e-88 2.221559e-52
190 1.000000e+00 5.736024e-45 2.434454e-95
191 1.000000e+00 4.502070e-80 3.721294e-46
192 1.000000e+00 7.073412e-113 2.117083e-81
193 1.000000e+00 5.134575e-52 8.682756e-47
194 1.000000e+00 9.195023e-126 3.007258e-71
195 1.000000e+00 1.487214e-87 2.630540e-41
196 1.000000e+00 9.689026e-107 2.712435e-62
197 1.000000e+00 4.463268e-130 5.531538e-69
198 1.000000e+00 5.495591e-91 1.539825e-47
199 1.000000e+00 1.805392e-82 2.836442e-41
200 1.000000e+00 1.028279e-123 2.594745e-65
201 1.000000e+00 7.028607e-120 1.460173e-44
202 1.000000e+00 2.064603e-77 6.387252e-86
203 1.000000e+00 3.070581e-112 2.416780e-46
204 1.000000e+00 8.443716e-131 7.699229e-61
205 1.000000e+00 5.034473e-79 8.759494e-48
206 1.000000e+00 1.247832e-118 2.619672e-56
207 1.000000e+00 1.046291e-108 3.826834e-43
208 1.000000e+00 4.093166e-71 1.731317e-77
209 1.000000e+00 6.860213e-72 2.136910e-48
210 1.000000e+00 1.269283e-79 8.616304e-73
211 1.000000e+00 2.750180e-63 1.025108e-55
212 1.000000e+00 7.667920e-138 1.981604e-63
213 1.000000e+00 1.118393e-82 1.219461e-42
214 1.000000e+00 1.993765e-98 3.966097e-72
215 1.000000e+00 6.105472e-91 1.731410e-39
216 1.000000e+00 4.647250e-168 4.056522e-51
217 1.000000e+00 1.385310e-97 2.832574e-40
218 1.000000e+00 2.621620e-114 1.352348e-72
220 1.000000e+00 8.632477e-125 8.344145e-71
221 1.000000e+00 2.591495e-77 7.813423e-46
222 1.000000e+00 3.248586e-145 3.191987e-61
223 1.000000e+00 1.311400e-104 1.558088e-43
224 1.000000e+00 3.566976e-78 7.591760e-74
225 1.000000e+00 1.138779e-97 8.241421e-70
226 1.000000e+00 4.596155e-114 9.416621e-49
227 1.000000e+00 2.254524e-91 1.187537e-46
228 1.000000e+00 9.484012e-120 4.411945e-71
229 1.000000e+00 8.464504e-111 2.395072e-42
230 1.000000e+00 8.286447e-147 7.570869e-76
231 1.000000e+00 3.488978e-101 1.392094e-42
232 1.000000e+00 2.690618e-91 4.928889e-90
233 1.000000e+00 1.674415e-120 5.032080e-38
234 1.000000e+00 1.810814e-148 3.469254e-61
235 1.000000e+00 5.692476e-108 2.484898e-44
236 1.000000e+00 1.130152e-117 5.371048e-67
237 1.000000e+00 3.160061e-99 1.046213e-45
238 1.000000e+00 4.235316e-112 4.820883e-74
239 1.000000e+00 4.723465e-70 3.696940e-48
240 1.000000e+00 3.876329e-154 2.180304e-55
241 1.000000e+00 1.269662e-123 3.931959e-41
242 1.000000e+00 1.245313e-125 2.493731e-66
243 1.000000e+00 4.957267e-110 2.215446e-44
244 1.000000e+00 1.002205e-119 6.243525e-67
245 1.000000e+00 7.196249e-94 4.540894e-49
246 1.000000e+00 1.071104e-121 1.507151e-72
247 1.000000e+00 1.726970e-85 1.237235e-52
248 1.000000e+00 1.570425e-121 1.850985e-60
249 1.000000e+00 1.501573e-99 3.577315e-70
250 1.000000e+00 4.034356e-107 2.046856e-39
251 1.000000e+00 6.894741e-118 3.942318e-46
252 1.000000e+00 3.637413e-114 1.380370e-66
253 1.000000e+00 9.974536e-115 5.983107e-40
254 1.000000e+00 4.214496e-161 7.208930e-58
255 1.000000e+00 1.839915e-101 2.583646e-51
256 1.000000e+00 2.884899e-128 2.469645e-61
258 1.000000e+00 1.986041e-94 1.689584e-85
259 1.000000e+00 2.984444e-56 4.996325e-62
260 1.000000e+00 1.247952e-155 3.919568e-62
261 1.000000e+00 1.782553e-76 5.280700e-53
262 1.000000e+00 3.314837e-122 2.323582e-79
263 1.000000e+00 1.677925e-135 1.493023e-39
264 1.000000e+00 1.198820e-137 3.330126e-69
265 1.000000e+00 8.029180e-62 6.684889e-58
266 1.000000e+00 4.356844e-129 1.647240e-62
267 1.000000e+00 1.150705e-90 5.117224e-37
268 1.000000e+00 2.544552e-178 1.158961e-53
270 1.000000e+00 7.893047e-128 6.341904e-80
271 1.000000e+00 5.236358e-127 9.840445e-39
273 1.000000e+00 2.258098e-111 1.118938e-42
274 1.000000e+00 7.780342e-140 1.175677e-69
275 1.000000e+00 7.922007e-104 3.689074e-56
276 1.000000e+00 4.308239e-118 1.367370e-77
277 9.383743e-73 4.229444e-53 1.000000e+00
278 2.414257e-46 6.692323e-33 1.000000e+00
279 4.687592e-52 1.229642e-45 1.000000e+00
280 1.048175e-60 3.444186e-21 1.000000e+00
281 5.469722e-54 1.129503e-52 1.000000e+00
282 2.241766e-70 1.074536e-56 1.000000e+00
283 4.593230e-61 5.951872e-27 1.000000e+00
284 2.288727e-61 5.338912e-73 1.000000e+00
285 1.432389e-60 1.726689e-45 1.000000e+00
286 2.039038e-50 3.258510e-34 1.000000e+00
287 9.109742e-72 1.098026e-65 1.000000e+00
288 6.685114e-45 7.275545e-30 1.000000e+00
289 3.123457e-71 3.729335e-74 1.000000e+00
290 2.497980e-64 4.169730e-94 1.000000e+00
291 1.295953e-74 4.031413e-65 1.000000e+00
292 1.838294e-49 1.795872e-43 1.000000e+00
293 7.633018e-50 2.033881e-08 1.000000e+00
294 7.713006e-95 5.573962e-187 1.000000e+00
295 2.164057e-66 8.161922e-34 1.000000e+00
296 4.115408e-48 1.364831e-66 1.000000e+00
297 1.978714e-56 2.528745e-16 1.000000e+00
298 2.777610e-57 4.234493e-43 1.000000e+00
299 1.206631e-74 3.632233e-24 1.000000e+00
300 2.515493e-47 1.752889e-37 1.000000e+00
301 1.966260e-77 2.935052e-51 1.000000e+00
302 6.646101e-54 9.510346e-75 1.000000e+00
303 3.208245e-87 2.559884e-89 1.000000e+00
304 1.575148e-52 1.308607e-37 1.000000e+00
305 1.340750e-70 2.068844e-59 1.000000e+00
306 2.051777e-51 1.462469e-77 1.000000e+00
307 1.418427e-65 1.883947e-06 9.999981e-01
308 9.302754e-49 2.214646e-66 1.000000e+00
309 6.913968e-68 6.639889e-27 1.000000e+00
310 1.217067e-57 1.420780e-69 1.000000e+00
311 6.092297e-54 2.616066e-40 1.000000e+00
312 4.367672e-85 1.831430e-104 1.000000e+00
313 5.181682e-72 1.506746e-68 1.000000e+00
314 9.562039e-46 9.723739e-63 1.000000e+00
315 1.754656e-90 1.469712e-63 1.000000e+00
316 1.634584e-54 2.249546e-86 1.000000e+00
317 7.277925e-54 1.567397e-30 1.000000e+00
318 1.370821e-69 3.584048e-60 1.000000e+00
319 6.936680e-55 4.964565e-42 1.000000e+00
320 1.631335e-79 7.066499e-64 1.000000e+00
321 2.551698e-86 8.374298e-111 1.000000e+00
322 1.666233e-54 5.578999e-83 1.000000e+00
323 4.887942e-82 2.089817e-90 1.000000e+00
324 1.767910e-51 1.671757e-39 1.000000e+00
325 3.012468e-55 3.049238e-44 1.000000e+00
326 4.008165e-89 1.181851e-112 1.000000e+00
327 7.332885e-95 1.178048e-103 1.000000e+00
328 3.782168e-57 2.803699e-64 1.000000e+00
329 1.058241e-74 9.870156e-61 1.000000e+00
330 7.903304e-51 1.246369e-44 1.000000e+00
331 1.368628e-63 1.565206e-13 1.000000e+00
332 2.730994e-62 2.762055e-61 1.000000e+00
333 5.220568e-80 2.129608e-59 1.000000e+00
334 1.514945e-45 4.068005e-24 1.000000e+00
335 2.029049e-57 3.448219e-51 1.000000e+00
336 7.675421e-61 3.661378e-11 1.000000e+00
337 8.942477e-59 1.327385e-61 1.000000e+00
338 6.212078e-80 1.959542e-90 1.000000e+00
339 6.325469e-78 5.897002e-70 1.000000e+00
340 1.160711e-67 1.231273e-101 1.000000e+00
341 1.194972e-71 8.123202e-17 1.000000e+00
342 2.133439e-53 4.893635e-52 1.000000e+00
343 5.050237e-61 1.191538e-66 1.000000e+00
344 2.773257e-79 6.303744e-89 1.000000e+00
Use the Publication data from ISLR2.
Split data into 80%-20% training and test set randomly.
Generate a multinomial logistic model to classify variable mech.
use the test data to predict mech variable. See if it is a reasonable fit.
What was the test error rate you get for the previous exercise?
\[Ave(I(y_0 \neq \hat y_0))\]
“a classifier that assigns each observation to the most likely class, given its predictor values” minimizes the test error rate.
This lowest error rate is called Bayes Error Rate
Bayes Decision Boundary
Why not always use Bayes Classifier?
Keep in mind the good old Bayes Rule
\[P(A|B) = \frac{P(B|A)* P(A)}{P(B)}\]
\(\pi_k\) is the overall probability of seeing \(k^{th}\) class of response in data.
\(f_k(X) = Pr(X|Y=k)\)
\[Pr(Y=k|X=x) = \frac{\pi_k*f_k(x)}{\sum_{l=1}^K\pi_lf_l(x)}\]
We are trying to approximate the Bayes classifier!! We will esplore linear discriminant analysis, quadratic discriminant analysis and naive Bayes
Over arching goal is to figure out the \(f_k(x)\)
To achieve our goal, we assume that \(f_k(x)\) is normal.
\[f_k(x) = \frac{1}{\sigma_k\sqrt{2\pi}}exp(-\frac{1}{2\sigma_k^2}(x-\mu_k)^2)\]
Here, \(\mu_k\) and \(\sigma_k^2\) is the mean and variance parameter of the \(k^th\) class.
we also assume, that \(\sigma_1^2 = ...\sigma_K^2\)
\[ Pr(Y=k|X=x) = \frac{\pi_k*\frac{1}{\sigma\sqrt{2\pi}}exp(-\frac{1}{2\sigma^2}(x-\mu_k)^2)}{\sum_{l=1}^K\pi_l\frac{1}{\sigma\sqrt{2\pi}}exp(-\frac{1}{2\sigma^2}(x-\mu_k)^2)} \]
\[ log(Pr(Y=k|X=x)) = x.\frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2} + log(\pi_k) \]
\[ x = \frac{\mu_1^2-\mu_2^2}{2(\mu_1-\mu_2)}= \frac{\mu_1 + \mu_2}{2} \]
lda_default_balance_student <-
MASS::lda(default ~ balance + student, data = Default)
lda_default_balance_studentCall:
lda(default ~ balance + student, data = Default)
Prior probabilities of groups:
No Yes
0.9667 0.0333
Group means:
balance studentYes
No 803.9438 0.2914037
Yes 1747.8217 0.3813814
Coefficients of linear discriminants:
LD1
balance 0.002244397
studentYes -0.249059498
training error rate
trivial null classifier
See the OJ data set in ISLR2
Use this data set to predict variable purchase
Split data into 80/20 training and testing.
Use training data to develop a LDA model. Use RoC and confusion matrix to gauge model effectiveness. Fine tune model. See chapter 9 TMWR.
predict test data with the fine tuned model.
Quadratic Discriminant Analysis
See the Smarket data in ISLR2.
Split in 80/20 training and testing.
Train LDA and QDA models.
Test these models and compare results - use test error rate.
What happens if you take n number of training data sets and n number of testing data sets, run LDA and QDA on each pair and plot training error rate and testing error rate distributions?
\[f_k(x) = f_{k1}(x_1)*f_{k2}(x_2)*...*f_{kp}(x_p)\]
\[pr(X) = \frac{\pi_k*f_{k1}(x_1)*f_{k2}(x_2)*...*f_{kp}(x_p)}{\sum_{l=1}^K \pi_l*f_{l1}(x_1)*f_{l2}(x_2)*...*f_{lp}(x_p)}\] > How is \(f_{kj} estimated?\)
use naiveBayes function from e1071 package.
Use Smarket data and compared results with QDA.
What method to use when the response is numeric but always takes the values of a non-negative integer?
Data: Bikeshare
Rows: 8,645
Columns: 15
$ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ mnth <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan,…
$ day <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ hr <fct> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ weekday <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ workingday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ weathersit <fct> clear, clear, clear, clear, clear, cloudy/misty, clear, cle…
$ temp <dbl> 0.24, 0.22, 0.22, 0.24, 0.24, 0.24, 0.22, 0.20, 0.24, 0.32,…
$ atemp <dbl> 0.2879, 0.2727, 0.2727, 0.2879, 0.2879, 0.2576, 0.2727, 0.2…
$ hum <dbl> 0.81, 0.80, 0.80, 0.75, 0.75, 0.75, 0.80, 0.86, 0.75, 0.76,…
$ windspeed <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0896, 0.0000, 0.0…
$ casual <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41, 1…
$ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70, 5…
$ bikers <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, 110…
Call:
lm(formula = bikers ~ workingday + temp + weathersit + mnth +
hr, data = Bikeshare)
Residuals:
Min 1Q Median 3Q Max
-299.00 -45.70 -6.23 41.08 425.29
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -68.632 5.307 -12.932 < 2e-16 ***
workingday 1.270 1.784 0.711 0.476810
temp 157.209 10.261 15.321 < 2e-16 ***
weathersitcloudy/misty -12.890 1.964 -6.562 5.60e-11 ***
weathersitlight rain/snow -66.494 2.965 -22.425 < 2e-16 ***
weathersitheavy rain/snow -109.745 76.667 -1.431 0.152341
mnthFeb 6.845 4.287 1.597 0.110398
mnthMarch 16.551 4.301 3.848 0.000120 ***
mnthApril 41.425 4.972 8.331 < 2e-16 ***
mnthMay 72.557 5.641 12.862 < 2e-16 ***
mnthJune 67.819 6.544 10.364 < 2e-16 ***
mnthJuly 45.324 7.081 6.401 1.63e-10 ***
mnthAug 53.243 6.640 8.019 1.21e-15 ***
mnthSept 66.678 5.925 11.254 < 2e-16 ***
mnthOct 75.834 4.950 15.319 < 2e-16 ***
mnthNov 60.310 4.610 13.083 < 2e-16 ***
mnthDec 46.458 4.271 10.878 < 2e-16 ***
hr1 -14.579 5.699 -2.558 0.010536 *
hr2 -21.579 5.733 -3.764 0.000168 ***
hr3 -31.141 5.778 -5.389 7.26e-08 ***
hr4 -36.908 5.802 -6.361 2.11e-10 ***
hr5 -24.135 5.737 -4.207 2.61e-05 ***
hr6 20.600 5.704 3.612 0.000306 ***
hr7 120.093 5.693 21.095 < 2e-16 ***
hr8 223.662 5.690 39.310 < 2e-16 ***
hr9 120.582 5.693 21.182 < 2e-16 ***
hr10 83.801 5.705 14.689 < 2e-16 ***
hr11 105.423 5.722 18.424 < 2e-16 ***
hr12 137.284 5.740 23.916 < 2e-16 ***
hr13 136.036 5.760 23.617 < 2e-16 ***
hr14 126.636 5.776 21.923 < 2e-16 ***
hr15 132.087 5.780 22.852 < 2e-16 ***
hr16 178.521 5.772 30.927 < 2e-16 ***
hr17 296.267 5.749 51.537 < 2e-16 ***
hr18 269.441 5.736 46.976 < 2e-16 ***
hr19 186.256 5.714 32.596 < 2e-16 ***
hr20 125.549 5.704 22.012 < 2e-16 ***
hr21 87.554 5.693 15.378 < 2e-16 ***
hr22 59.123 5.689 10.392 < 2e-16 ***
hr23 26.838 5.688 4.719 2.41e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 76.5 on 8605 degrees of freedom
Multiple R-squared: 0.6745, Adjusted R-squared: 0.6731
F-statistic: 457.3 on 39 and 8605 DF, p-value: < 2.2e-16
What if we adjustfor non-constant variance of \(\epsilon\) with Y.
Call:
lm(formula = log(bikers) ~ workingday + temp + weathersit + mnth +
hr, data = Bikeshare)
Residuals:
Min 1Q Median 3Q Max
-4.2919 -0.3038 0.0450 0.3807 2.5641
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.40308 0.04404 54.563 < 2e-16 ***
workingday -0.02036 0.01481 -1.375 0.169169
temp 1.05865 0.08516 12.432 < 2e-16 ***
weathersitcloudy/misty -0.05990 0.01630 -3.674 0.000240 ***
weathersitlight rain/snow -0.68523 0.02461 -27.845 < 2e-16 ***
weathersitheavy rain/snow -0.79376 0.63626 -1.248 0.212235
mnthFeb 0.23106 0.03558 6.494 8.84e-11 ***
mnthMarch 0.30883 0.03570 8.652 < 2e-16 ***
mnthApril 0.63591 0.04127 15.410 < 2e-16 ***
mnthMay 0.91154 0.04682 19.470 < 2e-16 ***
mnthJune 0.85752 0.05431 15.791 < 2e-16 ***
mnthJuly 0.76458 0.05877 13.010 < 2e-16 ***
mnthAug 0.77030 0.05510 13.979 < 2e-16 ***
mnthSept 0.85967 0.04917 17.483 < 2e-16 ***
mnthOct 0.91447 0.04108 22.259 < 2e-16 ***
mnthNov 0.80497 0.03826 21.041 < 2e-16 ***
mnthDec 0.63938 0.03544 18.040 < 2e-16 ***
hr1 -0.61508 0.04729 -13.005 < 2e-16 ***
hr2 -1.11341 0.04758 -23.402 < 2e-16 ***
hr3 -1.68041 0.04795 -35.042 < 2e-16 ***
hr4 -1.99993 0.04815 -41.532 < 2e-16 ***
hr5 -1.05245 0.04761 -22.106 < 2e-16 ***
hr6 0.18048 0.04734 3.813 0.000138 ***
hr7 1.14734 0.04725 24.285 < 2e-16 ***
hr8 1.81391 0.04722 38.415 < 2e-16 ***
hr9 1.53239 0.04724 32.436 < 2e-16 ***
hr10 1.22379 0.04735 25.847 < 2e-16 ***
hr11 1.34852 0.04749 28.397 < 2e-16 ***
hr12 1.53880 0.04764 32.302 < 2e-16 ***
hr13 1.53233 0.04780 32.055 < 2e-16 ***
hr14 1.46830 0.04794 30.629 < 2e-16 ***
hr15 1.50923 0.04797 31.463 < 2e-16 ***
hr16 1.76166 0.04790 36.775 < 2e-16 ***
hr17 2.17604 0.04771 45.612 < 2e-16 ***
hr18 2.08322 0.04760 43.765 < 2e-16 ***
hr19 1.79162 0.04742 37.781 < 2e-16 ***
hr20 1.49547 0.04734 31.593 < 2e-16 ***
hr21 1.23022 0.04725 26.036 < 2e-16 ***
hr22 0.99079 0.04722 20.984 < 2e-16 ***
hr23 0.58547 0.04720 12.403 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6349 on 8605 degrees of freedom
Multiple R-squared: 0.8092, Adjusted R-squared: 0.8084
F-statistic: 936 on 39 and 8605 DF, p-value: < 2.2e-16
What if we adjustfor non-constant variance of \(\epsilon\) with Y.
\[Pr(Y=k) = \frac{e^{-\lambda}\lambda^k}{k!}\]
\(Y \in {0,1,2,3,4,...}\)
\(k = 0,1,2,3,4,...\)
\(\lambda > 0\) is the expected value of \(Y\).
\(\lambda = E(Y) = Var(Y)\)
\(\lambda(X_1,..X_p)\) The expected mean is a function of p covariates.
\[log(\lambda(X_1,...,X_p)) = \beta_0 + \beta_1X_1+...+\beta_pX_p\] or
\[\lambda(X_1,...,X_p) = e^{\beta_0 + \beta_1X_1+...+\beta_pX_p})\]
\[l(\beta_0,\beta_1,...\beta_p) = \prod_{i=1}^n\frac{e^{-\lambda(x_i)}\lambda(x_i)^{y_i}}{y_i!}\]
glm(
bikers ~ workingday + temp + weathersit + mnth + hr,
data = Bikeshare,
family = poisson
)-> bikers_poi
summary(bikers_poi)
Call:
glm(formula = bikers ~ workingday + temp + weathersit + mnth +
hr, family = poisson, data = Bikeshare)
Deviance Residuals:
Min 1Q Median 3Q Max
-20.7574 -3.3441 -0.6549 2.6999 21.9628
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.693688 0.009720 277.124 < 2e-16 ***
workingday 0.014665 0.001955 7.502 6.27e-14 ***
temp 0.785292 0.011475 68.434 < 2e-16 ***
weathersitcloudy/misty -0.075231 0.002179 -34.528 < 2e-16 ***
weathersitlight rain/snow -0.575800 0.004058 -141.905 < 2e-16 ***
weathersitheavy rain/snow -0.926287 0.166782 -5.554 2.79e-08 ***
mnthFeb 0.226046 0.006951 32.521 < 2e-16 ***
mnthMarch 0.376437 0.006691 56.263 < 2e-16 ***
mnthApril 0.691693 0.006987 98.996 < 2e-16 ***
mnthMay 0.910641 0.007436 122.469 < 2e-16 ***
mnthJune 0.893405 0.008242 108.402 < 2e-16 ***
mnthJuly 0.773787 0.008806 87.874 < 2e-16 ***
mnthAug 0.821341 0.008332 98.573 < 2e-16 ***
mnthSept 0.903663 0.007621 118.578 < 2e-16 ***
mnthOct 0.937743 0.006744 139.054 < 2e-16 ***
mnthNov 0.820433 0.006494 126.334 < 2e-16 ***
mnthDec 0.686850 0.006317 108.724 < 2e-16 ***
hr1 -0.471593 0.012999 -36.278 < 2e-16 ***
hr2 -0.808761 0.014646 -55.220 < 2e-16 ***
hr3 -1.443918 0.018843 -76.631 < 2e-16 ***
hr4 -2.076098 0.024796 -83.728 < 2e-16 ***
hr5 -1.060271 0.016075 -65.957 < 2e-16 ***
hr6 0.324498 0.010610 30.585 < 2e-16 ***
hr7 1.329567 0.009056 146.822 < 2e-16 ***
hr8 1.831313 0.008653 211.630 < 2e-16 ***
hr9 1.336155 0.009016 148.191 < 2e-16 ***
hr10 1.091238 0.009261 117.831 < 2e-16 ***
hr11 1.248507 0.009093 137.304 < 2e-16 ***
hr12 1.434028 0.008936 160.486 < 2e-16 ***
hr13 1.427951 0.008951 159.529 < 2e-16 ***
hr14 1.379296 0.008999 153.266 < 2e-16 ***
hr15 1.408149 0.008977 156.862 < 2e-16 ***
hr16 1.628688 0.008805 184.979 < 2e-16 ***
hr17 2.049021 0.008565 239.221 < 2e-16 ***
hr18 1.966668 0.008586 229.065 < 2e-16 ***
hr19 1.668409 0.008743 190.830 < 2e-16 ***
hr20 1.370588 0.008973 152.737 < 2e-16 ***
hr21 1.118568 0.009215 121.383 < 2e-16 ***
hr22 0.871879 0.009536 91.429 < 2e-16 ***
hr23 0.481387 0.010207 47.164 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1052921 on 8644 degrees of freedom
Residual deviance: 228041 on 8605 degrees of freedom
AIC: 281159
Number of Fisher Scoring iterations: 5